Going to try replicating the best classifier we've managed so far by using only a single feature, then include more features until the performance changes.


In [1]:
import matplotlib
import matplotlib.pyplot as plt
import numpy as np

In [2]:
%matplotlib inline
plt.rcParams['figure.figsize'] = 6, 4.5
plt.rcParams['axes.grid'] = True
plt.gray()


<matplotlib.figure.Figure at 0x7f8c8b697080>

Loading the data

As before, loading the data. But, now there's more data to load.


In [3]:
cd ..


/home/gavin/repositories/hail-seizure

In [4]:
import train
import json
import imp

In [5]:
settings = json.load(open('SETTINGS.json', 'r'))

In [6]:
settings['FEATURES']


Out[6]:
['ica_feat_var_',
 'ica_feat_cov_',
 'ica_feat_corrcoef_',
 'ica_feat_pib_',
 'ica_feat_xcorr_',
 'ica_feat_psd_',
 'ica_feat_psd_logf_',
 'ica_feat_coher_',
 'ica_feat_coher_logf_',
 'raw_feat_var_',
 'raw_feat_cov_',
 'raw_feat_corrcoef_',
 'raw_feat_pib_',
 'raw_feat_xcorr_',
 'raw_feat_psd_',
 'raw_feat_psd_logf_',
 'raw_feat_coher_',
 'raw_feat_coher_logf_']

In [7]:
data = train.get_data(settings['FEATURES'])

In [9]:
!free -m


             total       used       free     shared    buffers     cached
Mem:         11933      11716        216        455         39       2347
-/+ buffers/cache:       9329       2603
Swap:        12287        690      11597

Developing training script

It's not generally a good idea to run anything that's going to take a long time in an IPython notebook. The thing can freeze, or if it's disconnected lose work. Going to develop a script here that can be run locally or on salmon.


In [9]:
# getting a set of the subjects involved
subjects = set(list(data.values())[0].keys())
print(subjects)


{'Patient_1', 'Dog_1', 'Patient_2', 'Dog_4', 'Dog_3', 'Dog_5', 'Dog_2'}

In [10]:
import sklearn.preprocessing
import sklearn.pipeline
import sklearn.ensemble
import sklearn.cross_validation
from train import utils

Feature selection

We want to do some simple feature selection, as even with a massive amount of RAM available there's no point in using features that are obviously useless. The first suggestion for this is a variance threshold, removing features with low variances.


In [13]:
X,y = utils.build_training(list(subjects)[0],list(data.keys()),data)

In [27]:
h=plt.hist(np.log10(np.var(X,axis=0)))


However, I don't really like this much, as low variance doesn't mean that there won't be information there. After all, variance scales with the multiplicative constants.

A better approach is scikit-learns SelectKBest, which can use $\chi^2$ or ANOVA f-values. Can't use $\chi^2$ as demands non-negative features. Trying each down to the 50 best and attempting to plot in 2d with PCA:


In [28]:
import sklearn.feature_selection

In [30]:
Xbest = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif, k=50).fit_transform(X,y)

In [31]:
import sklearn.decomposition

In [40]:
pca = sklearn.decomposition.PCA(n_components=2)
scaler = sklearn.preprocessing.StandardScaler()
twodX = pca.fit_transform(scaler.fit_transform(Xbest))
plt.scatter(twodX[:,0][y==1],twodX[:,1][y==1],c='blue')
plt.scatter(twodX[:,0][y==0],twodX[:,1][y==0],c='red')


Out[40]:
<matplotlib.collections.PathCollection at 0x7fe929030518>

Looking good, now doing the same with the magical t-SNE:


In [41]:
import sklearn.manifold

In [42]:
tsne = sklearn.manifold.TSNE()
twodX = tsne.fit_transform(scaler.fit_transform(Xbest))
plt.scatter(twodX[:,0][y==1],twodX[:,1][y==1],c='blue')
plt.scatter(twodX[:,0][y==0],twodX[:,1][y==0],c='red')


Out[42]:
<matplotlib.collections.PathCollection at 0x7fe928042d30>

Also looking good. So, all we do is add the selection to our model and then also turn the n_estimators up and we should get a better prediction.


In [44]:
selection = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif,k=3000)
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('sel',selection),('scl',scaler),('clf',forest)])

Testing this model on this single subject:


In [45]:
def subjpredictions(subject,model,data):
    X,y = utils.build_training(subject,list(data.keys()),data)
    cv = sklearn.cross_validation.StratifiedShuffleSplit(y)
    predictions = []
    labels = []
    allweights = []
    for train,test in cv:
        # calculate weights
        weight = len(y[train])/sum(y[train])
        weights = np.array([weight if i == 1 else 1 for i in y[train]])
        model.fit(X[train],y[train],clf__sample_weight=weights)
        predictions.append(model.predict_proba(X[test]))
        weight = len(y[test])/sum(y[test])
        weights = np.array([weight if i == 1 else 1 for i in y[test]])
        allweights.append(weights)
        labels.append(y[test])
    predictions = np.vstack(predictions)[:,1]
    labels = np.hstack(labels)
    weights = np.hstack(allweights)
    return predictions,labels,weights

In [47]:
p,l,w = subjpredictions(list(subjects)[0],model,data)

In [48]:
sklearn.metrics.roc_auc_score(l,p)


Out[48]:
0.96511111111111114

In [49]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p)
plt.plot(fpr,tpr)


Out[49]:
[<matplotlib.lines.Line2D at 0x7fe928034518>]

It certainly works a bit better than the classifier I was working with before. What if we increase the number of estimators to deal with the much larger number of features?


In [51]:
model.set_params(clf__n_estimators=5000)


Out[51]:
Pipeline(steps=[('sel', SelectKBest(k=3000, score_func=<function f_classif at 0x7fe939067730>)), ('scl', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', RandomForestClassifier(bootstrap=True, compute_importances=None,
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_density=None, min_samples_leaf=1,
            min_samples_split=2, n_estimators=5000, n_jobs=1,
            oob_score=False, random_state=None, verbose=0))])

In [53]:
%%time
p,l,w = subjpredictions(list(subjects)[0],model,data)


CPU times: user 2min 30s, sys: 36.3 s, total: 3min 6s
Wall time: 3min 6s

In [54]:
sklearn.metrics.roc_auc_score(l,p)


Out[54]:
0.99296296296296305

In [55]:
fpr,tpr,thresholds = sklearn.metrics.roc_curve(l,p)
plt.plot(fpr,tpr)


Out[55]:
[<matplotlib.lines.Line2D at 0x7fe927f4ecc0>]

Actually, it looks like I could probably just run this in the notebook, and it'd probably be fine. Will write the script after doing this.


In [56]:
features = list(data.keys())

In [57]:
%%time
predictiondict = {}
for subj in subjects:
    # training step
    X,y = utils.build_training(subj,features,data)
    # weights
    weight = len(y)/sum(y)
    weights = np.array([weight if i == 1 else 1 for i in y])
    model.fit(X,y,clf__sample_weight=weights)
    # prediction step
    X,segments = utils.build_test(subj,features,data)
    predictions = model.predict_proba(X)
    for segment,prediction in zip(segments,predictions):
        predictiondict[segment] = prediction


/home/gavin/.local/lib/python3.4/site-packages/sklearn/feature_selection/univariate_selection.py:106: RuntimeWarning: invalid value encountered in true_divide
  f = msb / msw
/home/gavin/.local/lib/python3.4/site-packages/sklearn/feature_selection/univariate_selection.py:106: RuntimeWarning: invalid value encountered in true_divide
  f = msb / msw
CPU times: user 5min 48s, sys: 1min 1s, total: 6min 50s
Wall time: 8min 22s
/home/gavin/.local/lib/python3.4/site-packages/sklearn/feature_selection/univariate_selection.py:106: RuntimeWarning: invalid value encountered in true_divide
  f = msb / msw

In [58]:
import csv

In [59]:
with open("output/protosubmission.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for seg in predictiondict.keys():
        c.writerow([seg,"%s"%predictiondict[seg][-1]])

Submitting now, and got approximately 0.4, worse than all zeros. How did that happen? Something to do with the warnings above, possibly?

Checking if there's anything obviously weird with the output file:


In [60]:
!head output/protosubmission.csv











Nope, looks ok.

There are three warnings above, so this problem is only occurring on three of the subjects. Could run each subject individually and try to find one where the ROC completely falls apart.


In [ ]:
%%time
for s in subjects:
    p,l,w = subjpredictions(s,model,data)
    print("subject %s"%s, sklearn.metrics.roc_auc_score(l,p,sample_weight=w))

Simple solution is apparently to run a variance threshold before running the feature selection (like it says on the wiki really).


In [62]:
thresh = sklearn.feature_selection.VarianceThreshold()
selection = sklearn.feature_selection.SelectKBest(sklearn.feature_selection.f_classif,k=3000)
scaler = sklearn.preprocessing.StandardScaler()
forest = sklearn.ensemble.RandomForestClassifier()
model = sklearn.pipeline.Pipeline([('thr',thresh),('sel',selection),('scl',scaler),('clf',forest)])

In [63]:
%%time
predictiondict = {}
for subj in subjects:
    # training step
    X,y = utils.build_training(subj,features,data)
    # weights
    weight = len(y)/sum(y)
    weights = np.array([weight if i == 1 else 1 for i in y])
    model.fit(X,y,clf__sample_weight=weights)
    # prediction step
    X,segments = utils.build_test(subj,features,data)
    predictions = model.predict_proba(X)
    for segment,prediction in zip(segments,predictions):
        predictiondict[segment] = prediction


CPU times: user 2min 14s, sys: 47 s, total: 3min 1s
Wall time: 4min 58s

In [64]:
with open("output/protosubmission.csv","w") as f:
    c = csv.writer(f)
    c.writerow(['clip','preictal'])
    for seg in predictiondict.keys():
        c.writerow([seg,"%s"%predictiondict[seg][-1]])